SOUND SOURCE SEPARATION USING CONVOLUTIONAL MIXING 
AND A PRIORI SOUND SOURCE KNOWLEDGE 

RELATED APPLICATIONS 

This application claims the benefit of and priority to the previously filed 
5 provisional patent application entitled "Speech/Noise Separation Using Two 

Microphones and a Model of Speech Signals," filed on April 26, 2000, and assigned 
serial number 60/199,782. 

FIELD OF THE INVENTION 

The invention relates generally to sound source separation, and more particularly 
10 to sound source separation using a convolutional mixing model. 

BACKGROUND OF THE INVENTION 

Sound source separation is the process of separating into separate signals two or 
more sound sources from at least that many number of recorded microphone signals. For 
example, within a conference room, there may be five different people talking, and five 

1 5 microphones placed around the room to record their conversations. In this instance, 

sound source separation involves separating the five recorded microphone signals into a 
signal for each of the speakers. Sound source separation is used in a number of different 
applications, such as speech recognition. For example, in speech recognition, the 
speaker's voice is desirably isolated from any background noise or other speakers, so that 

20 the speech recognition process uses the cleanest signal possible to determine what the 
speaker is saying. 
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The diagram 100 of FIG. 1 shows an example environment in which sound source 
separation may be used. The voice of the speaker 104 is recorded by a number of 
differently located microphones 106, 108, 1 10, and 112. Because the microphones are 
located at different positions, they will record the voice of the speaker 104 at different 

5 times, at different volume levels, and with different amounts of noise. The goal of the 
sound source separation in this instance is to isolate in a single signal just the voice of the 
speaker 104 from the recorded microphone signals. Typically, the speaker 104 is 
modeled as a point source, although it is more diffuse in reality. Furthermore, the 
microphones 1 06, 1 08, 1 1 0, and 1 1 2 can be said to make up a microphone array. The 

10 pickup pattern of FIG 1 tends to be less selective at lower frequencies. 

One approach to sound source separation is to use a microphone array in 
combination with the response characteristics of each microphone. This approach is 
referred to as delay-and-sum beamforming. For example, a particular microphone may 
have the pickup pattern 200 of FIG. 2. The microphone is located at the intersection of 

15 the x axis 210 and the y axis 212, which is the origin. The lobes 202, 204, 206, and 208 
indicate where the microphone is most sensitive. That is, the lobes indicate where the 
microphone has the greatest response, or gain. For example, the microphone modeled by 
the graph 200 has the greatest response where the lobe 202 intersects with the y axis 212 
in the negative y direction. 

20 By using the pickup pattern of each microphone, along with the location of each 

microphone relative to the fixed position of the speaker, delay-and-sum beamforming can 
be used to separate the speaker's voice as an isolated signal. This is because the 
incidence angle between each microphone and the speaker can be determined a priori, as 
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well as the relative delay in which the microphones will pick up the speaker's voice, and 
the degree of attenuation of the speaker's voice when each microphone records it. 
Together, this information is used to separate the speaker's voice as an isolated signal. 

However, the delay-and-sum beamforming approach to sound source separation is 
5 useful primarily only in soundproof rooms, and other near-ideal environments where no 
reverberation is present. Reverberation, or "reverb," is the bouncing of sound waves off 
surfaces such as walls, tables, windows, and other surfaces. Delay-and-sum 
beamforming assumes that no reverb is present. Where reverb is present, which is 
typically the case in most real-world situations where sound source separation is desired, 

10 this approach loses its accuracy in a significant manner. 

An example of reverb is depicted in the graph 300 of FIG. 3. The graph 300 
depicts the sound signals picked up by a microphone over time, as indicated by the time 
axis 302. The volume axis 304 indicates the relative amplitude of the volume of the 
signals recorded by the microphone. The original signal is indicated as the signal 306. 

15 Two reverberations are shown as a first reverb signal 308, and a second reverb signal 
310. The presence of the reverb signals 308 and 310 limits the accuracy of the sound 
source separation using the delay-and-sum beamforming approach. 

Another approach to sound source separation is known as independent component 
analysis (ICA) in the context of instantaneous mixing. This technique is also referred to 

20 as blind source separation (BSS). BSS means that no information regarding the sound 
sources is known a priori, apart from their assumed mutual statistical independence. In 
laboratory conditions, ICA in the context of instantaneous mixing achieves signal 
separation up to a permutation limitation. That is, the approach can separate the sound 
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sources correctly, but cannot identify which output signal is the first sound source, which 
is the second sound source, and so on. However, BSS also fails in real-world conditions 
where reverberation is present, since it does not take into account reverb of the sound 
sources. 

5 Mathematically, ICA for instantaneous mixing assumes that R microphone 

signals, j ; [w],y[w] = (y l [n] 9 y 2 [n] 9 ...y R [n]) 9 are obtained by a linear combination of R 

sound source signals x f [w],x[w] = (x l [n],x 2 [n] :> ...,x R [n]) . This is written as: 

y[n] = \x[n] (1) 
for all n, where V is the RxR mixing matrix. The mixing is instantaneous in that the 
10 microphone signals at any time n depend on the sound source signals at the same time, 
but at no earlier time. In the absence of any information about the mixing, the BSS 
problem estimates a separating matrix W = V" 1 from the recorded microphone signals 
alone. The sound source signals are recovered by: 

x[w] = Wy[*]. (2) 
15 A criterion is selected to estimate the unmixing matrix W. One solution is to use 

the probability density function (pdf) of the source signals, p x (x[n]) , such that the pdf of 
the recorded microphone signals is: 

P y (y[n])=\yV\ Px (Wy[n]). (3) 

Because the sound source signals are assumed to be independent from themselves over 
20 time, x[n + i],i * 0, the joint probability is: 

^=^ y (y[0],y[i],..,y[iv-i]) 

N-l N-l (4) 

= n^(yw) = i w r n^(wy[«i). 
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The gradient of *P is: 



aw 



dy/ 



( W T l +^I>(WyM)(yM) 3 

N «=1 



(5) 



where <f>(x) is: 



ax 



(6) 
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From equations (4), (5), and (6), a gradient descent solution, known as the 



infomax rule, can be obtained for W given p % (x). That is, given the probability density 
function of the sound source signals, the separating matrix W can be obtained. The 
density function p x (x) may be Gaussian, Laplacian, a mixture of Gaussians, or another 



10 prior or a mixture of Gaussian priors generally yields better separation of the sound 
source signals from the recorded microphone signals than a Gaussian prior does. 

As has been indicated, however, although the ICA approach in the context of 
instantaneous mixing does achieve sound source signal separation in environments where 
reverberation is non-existent, the approach is unsatisfactory where reverb is present. 

15 Because reverb is present in most real-world situations, therefore, the instantaneous 
mixing ICA approach is limited in its practicality. An approach that does take into 
account reverberation is known as convolutional mixing ICA. Convolutional mixing 
takes into consideration the transfer functions between the sound sources and the 
microphones created by environmental acoustics. By considering environmental 

20 acoustics, convolutional mixing thus takes into account reverberation. 

The primary disadvantage to convolutional mixing ICA is that, because it operates 
in the frequency domain instead of in the time domain, the permutation limitation of ICA 



type of prior, depending on the degree of separation desired. For example, a Laplacian 
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occurs on a per- frequency component basis. This means that the reconstructed sound 
source signals may have frequency components belonging to different sound sources, 
resulting in incomprehensible reconstructed signals. For example, in the diagram 400 of 
FIG. 4, the output sound source signal 402 is reconstructed by convolutional mixing ICA 
5 from two sound source signals, a first sound source signal 404, and a signal sound source 
signal 406. Each of the signals 402, 404, and 406 has a frequency spectrum from a low 
frequency f L to a high frequency f 9 . The output signal 402 is meant to reconstruct either 
the first signal 404 or the second signal 406. 

However, in actuality, the first frequency component 408 of the output signal 402 

10 is that of the second signal 406, and the second frequency component 410 of the output 
signal 402 is that of the first signal 404. That is, rather than the output signal 402 having 
the first and the second components 412 and 410 of the first signal 404, or the first and 
the second components 408 and 414 of the second signal 406, it has the first component 
408 from the second signal 406, and the second component 410 from the first signal 404. 

15 To the human ear, and for applications such as speech recognition, the reconstructed 
output sound source signal 402 is meaningless. 

Mathematically, convolutional mixing ICA is described with respect to two sound 
sources and two microphones, although the approach can be extended to any number of R 
sources and microphones. An example environment is shown in the diagram 500 of FIG. 

20 5, in which the voices of a first speaker 502 and a second speaker 504 are recorded by a 
first microphone 506 and a second microphone 508. The first speaker 502 is represented 
as the point sound source x x [n] , and the second speaker 502 is represented as the point 
sound source x 2 [n] . The first microphone 506 records the microphone signal y x \n\ , 
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whereas the second microphone 508 records the microphone signal y 2 [n] . The input 
signals x x [n] and x 2 [n] are said to be filtered with filters g g [n] to generate the 
microphone signals, where the filters g.j[n] take into account the position of the 
microphones, room acoustics, and so on. Reconstruction filters h & [n] are then applied to 
5 the microphone signals y x [ri\ and y 2 [n] to recover the original input signals, as the 
output signals x x [n] and x 2 [n]. 

This model is shown in the diagram 600 of FIG. 6. The voice of the first speaker 
502, x l [n] , is affected by environmental and other factors indicated by the filters 602a 
and 602b, represented as g u [n] and g l2 [n] . The voice of the second speaker 504, x 2 [n] , 
10 is affected by environmental and other factors indicated by the filters 602c and 602d, 

represented as g 2l [n] and g 22 [n] . The first microphone 506 records a microphone signal 
y x [ri\ equal to x x [ri\ *g n [n] + x 2 [n] * g 2l [n] , where * represents the convolution operator 

OO 

defined as y[n] = x[n] * h[n] = ^ x[m]h[n - m] . The second microphone 508 records a 

microphone signal y 2 [n] equal to x 2 [n]* g^l^ + x^n]* g l2 [n]. The first microphone 
15 signal 

y^n] is input into the reconstruction filters 604a and 604b, represented by h u [n] and 
h l2 [n] . The second microphone signal y 2 [n] is input into the reconstruction filters 604c 
and 604d, represented by h 2l [n] and h 22 [n] . The reconstructed source signal 502' is 
determined by solving x x [n] = y x [n\ * ^[n] + y 2 [n] * h 21 [n] . Similarly, the reconstructed 
20 source signal 504' is determined by solving x 2 [n] = y 2 [n] * h 22 [n] + y x [ri\ * t\ 2 [n] . 
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The reconstruction filters 604a, 604b, 604c, and 604d, or h & [n] , completely 
recovers the original signals of the speakers 502 and 504, or x&ri] , if and only if their z- 
transforms are the inverse of the z-transforms of the mixing filters 602a, 602b, 602c, and 
602d, or g & [n] . Mathematically, this is: 



'H n {z) H l2 (z)] (G u (z) G l2 (z)^ 1 



H n (z) H 22 (z\ 



G 21 (z) G n {z) 
1 



(7) 

G n (z) G 12 (2) N 
G 2l (z) G 22 {z)^ 



G u (z)G 22 (z)-G l2 (z)G 21 (z) 
The mixing filters 602a, 602b, 602c, and 602d, or g.j[n] , can be assumed to be 

finite infinite response (FIR) filters, having a length that depends on environmental and 
other factors. These factors may include room size, microphone position, wall 
absorbance, and so on. This means that the reconstruction filters 604a, 604b, 604c, and 
10 604d, or h^[n] , have an infinite impulse response. Since using an infinite number of 
coefficients is impractical, the reconstruction filters are assumed to be FIR filters of 
length q 9 which means that the original signals from the speakers 502 and 504, x t [ri\ , will 

not be recovered exactly as x-[n] . That is, x^n] * x.[n] 9 but x ( [n] « x^n] . 

The convolutional mixing ICA approach achieves sound separation by estimating 
15 the reconstruction filters hy[n] from the microphone signals yj[n] using the infomax 

rule. Reverberation is accounted for, as well as other arbitrary transfer functions. 
However, estimation of the reconstruction filters h^n] using the infomax rule still 

represents an less than ideal approach to sound separation, because, as has been 
mentioned, permutations can occur on a per- frequency component basis in each of the 
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output signals x £ [n] . Whereas the BSS and instantaneous mixing ICA approaches 
achieve proper sound separation but cannot take into account reverb, the convolutional 
mixing infomax ICA approach can take into account reverb but achieves improper sound 
separation. 

5 For these and other reasons, therefore, there is a need for the present invention. 

SUMMARY OF THE INVENTION 

This invention uses reconstruction filters that take into account a priori knowledge 
of the sound source signal desired to be separated from the other sound source signals to 
achieve separation without permutation when performing convolutional mixing 

10 independent component analysis (ICA). For example, the sound source signal desired to 
be separated from the other sound source signals, referred to as the target sound source 
signal, may be human speech. In this case, the reconstruction filters may be constructed 
based on an estimate of the spectra of the target sound source signal A hidden Markov 
model (HMM) speech recognition speech can be employed to determine whether a 

15 reconstructed signal is properly separated human speech. The reconstructed signal is 
matched against the words of the dictionary of the speech recognition speech. A high 
probability match to one of the dictionary's words indicates that the reconstructed signal 
is properly separated human speech. 

Alternatively, a vector quantization (VQ) codebook of vectors may be employed 

20 to determine whether a reconstructed signal is properly separated human speech. The 
vectors may be linear prediction (LPC) vectors or other types of vectors extracted from 
the input signal The vectors specifically represent human speech patterns typical of the 
target sound source signal, and generally represent sound source patterns typical of the 
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target sound source signal. The reconstructed signal is matched against the vectors, or 
code words, of the codebook. A high probability match to one of the codebook's vectors 
indicates that the reconstructed signal is properly separated human speech. The VQ 
codebook approach requires a significantly smaller number of speech patterns than the 
5 number of words in the dictionary of a speech recognition system. For example, there 
may be only sixteen or 256 vectors in the codebook, whereas there may be tens of 
thousands of words in the dictionary of a speech recognition system. 

By employing a priori knowledge of the target sound source signal, the invention 
overcomes the disadvantages associated with the convolutional mixing infomax ICA 

10 approach as found in the prior art. Convolutional mixing ICA according to the invention 
generates reconstructed signals that are separated, and not merely decorrelated. That is, 
the invention allows convolutional mixing ICA without permutation, because the a priori 
knowledge of the target sound source signal ensures that frequency components of the 
reconstructed signals are not permutated. The a priori knowledge of the target sound 

1 5 source signal itself is encapsulated in the reconstruction filters, and is represented in the 
words of the speech recognition system's dictionary or the patterns of the VQ codebook. 
Other advantages, aspects, and embodiments of the invention will become apparent by 
reading the detailed description, and referring to the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

20 FIG. 1 is a diagram of an example environment in which sound source separation 

may be used. 

FIG. 2 is a diagram of an example response, or gain, graph of a microphone. 
FIG. 3 is a diagram showing an example of reverberation. 
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FIG. 4 is a diagram showing how convolutional mixing independent component 
analysis (ICA) can generate reconstructed signals exhibiting permutation on a per- 
frequency component basis. 

FIG. 5 is a diagram of an example environment in which sound source separation 
5 via convolutional mixing ICA can be used. 

FIG. 6 is a diagram showing an example mode of convolutional mixing ICA. 

FIG. 7 is a flowchart of a method showing the general approach of the invention 
to achieve sound source separation. 

FIG. 8 is a flowchart of a method showing the cepstral approach used by one 
10 embodiment to construct the reconstruction filters employed in sound source separation. 

FIG. 9 is a flowchart of a method showing the vector quantization (VQ) codebook 
approach used by one embodiment to construct the reconstruction filters employed in 
sound source separation. 

FIG. 10 is a flowchart of a method outlining the expectation maximization (EM) 
15 algorithm. 

FIG. 1 1 is a diagram of an example computing device in conjunction with which 
the invention may be implemented. 

DETAILED DESCRIPTION OF THE INVENTION 

In the following detailed description of exemplary embodiments of the invention, 
20 reference is made to the accompanying drawings that form a part hereof, and in which is 
shown by way of illustration specific exemplary embodiments in which the invention 
may be practiced. These embodiments are described in sufficient detail to enable those 
skilled in the art to practice the invention. Other embodiments may be utilized, and 
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logical, mechanical, electrical, and other changes may be made without departing from 
the spirit or scope of the present invention. The following detailed description is, 
therefore, not to be taken in a limiting sense, and the scope of the present invention is 
defined only by the appended claims. 

5 General Approach 

FIG. 7 shows a flowchart 700 of the general approach followed by the invention 
to achieve sound source separation. The target sound source is the voice of the speaker 
502, which is also referred to as the first sound source. Other sound sources are grouped 
into a second sound source 706. The second sound source 706 may be the voice of 

10 another speaker, such as the speaker 504, music, or other types of sound and noise that 
are not desired in the output sound source signals. Each of the first sound source 502 and 
the second sound source 706 are recorded by the microphones 506 and 508. The 
microphones 506 and 508 are used to produce microphone signals (702). The 
microphones are referred to generally as sound input devices. 

1 5 The microphone signals are then subjected to unmixing filters (704) to yield the 

output sound source signals 502' and 706*. The first output sound source signal 502' is 
the reconstruction of the first sound source, the voice of the speaker 502. The second 
output sound source signal 706' is the reconstruction of the second sound source 706. 
The unmixing filters are applied in 704 according to a convolutional mixing independent 

20 component analysis (ICA), which was generally described in the background section. 
However, the inventive unmixing filters have two differences and advantages. First, it 
does not need to be assumed that a sound source is independent from itself over time. 
That is, it exhibits correlation over time. Second, an estimate of the spectrum of the 
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sound source signal that is desired is obtained a priori. This guides decollation such 
that signal separation occurs. 

That is, a priori sound source knowledge allows the convolutional mixing ICA of 
the invention to reach sound source separation, and not just sound source permutation. 

5 The permutation on a per-frequency component basis shown as a disadvantage of 

convolutional mixing infomax ICA in FIG. 4 is avoided by basing the unmixing filters on 
an a priori estimate of the spectrum of the sound source signal. The permutation 
limitation of convolutional mixing infomax ICA is removed, allowing complete 
separation and decorrelation of the output sound source signals. Otherwise, the inventive 

10 approach to convolutional mixing ICA can be the same as that described in the 

background section, such that, for example, FIGs. 5 and 6 can depict embodiments of the 
invention. 

For example, reverberation and other acoustical factors can be present when 
recording the microphone signals, without a significant loss of accuracy of the resulting 

15 separation. Such factors, generally referred to as acoustical factors, are implicitly 
depicted in the mixing filters 602a, 602b, 602c, and 602d of FIG. 6. Furthermore, the 
unmixing filters 604a, 604b, 604c, and 604d of FIG. 6 also depict the inventive unmixing 
filters, where the inventive filters have the added limitation that they are based on 
knowledge of the desired target sound source signal. 

20 The general approach of FIG. 7 shows two input sound sources, with one of the 

sound sources being a target sound source that is the voice of a human speaker. This is 
for example purposes only, however. There can be more than two sound sources, so long 
as there are at least as many microphones as sound sources. Furthermore, the target 
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sound source may be other than the voice of a human speaker, so long as the unmixing 
filters are based on a priori knowledge of the type of sound source being targeted for 
separation purposes. 

Speech Recognition Approach 

5 To construct separation, or unmixing or reconstruction, filters based on 

knowledge of the type of sound source being targeted, one embodiment utilizes 
commonly available speech recognition systems where the target sound source is human 
speech. A speech recognition system is used to indicate whether a given decorrelated 
signal is a proper separated signal, or an improper permutated signal. This approach is 

10 also referred to as the cepstral approach, in that word matching is accomplished to 
determine the most likely word to which the decorrelated signal corresponds. 

Mathematically, the reconstruction filters are assumed to be finite infinite 
response (FIR) filters of length q. Although this means that the original sound source 
signals x x [ri\ and x 2 [n] will not be exactly recorded, this is not disadvantageous. The 

15 target speech signal is represented as x x [n\ , whereas the second signal x 2 [n] represents 
all other sound collectively called interference. Without lack of generation, an estimated 
of the desired output signal x x [n] is: 

x x [n] = \[n]*y x [n] + h 2 \n\*y 2 [n\ 

q-X £4 (8) 

= &[/M"-/]+&My 2 [«-']- 

Using the notation introduced in the background section, h^n] represents the 

20 reconstruction filters. Where h has only a single subscript, this means that the filter being 
represented is one of the filters corresponding to the desired output signal. For example, 
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\[n} is shorthand for h u [n] , where the desired output signal is *,[»] . Similarly, h 2 [n] 

is shorthand for AJs] , where the desired output signal is xjn] . The recorded 

microphone signals are again represented by y { [n] and y 2 [n] . 

Two vectors are next introduced: 

h^^OlAWv.,^?-!])' (9) 
h 2 =(A 2 [0],A 2 [l],-^ 2 [^-l]) r - 

The M sample microphone signals for z=l,2 are represented as the vector: 

y ; =Wo],y,[i],...,^[M-i]}. (10) 

A typical speech recognition system finds the word sequence W that maximizes 
the probability given a model X and an input signal s[n]: 

10 W = argmaxp(W\l,s[n]). (11) 

w 

The cepstral approach to constructing unmixing filters is depicted in the flowchart 
800 of FIG. 8. To accomplish speech recognition of the reconstructed signal 
xjn] = {^[0], jtjl],..., xJM - 1]} , the maximum a posteriori (MAP) estimate is found 
(802) by summing over all possible word strings W within the dictionary of the speech 
15 recognition system, and all possible filters h, and h 2 : 

x = argmax/?(i|yi,y 2 ) = argmax £ p(±,W, h l ,h 2 \y 1 ,y 2 ) 

x i WfrMz (12) 

« arg max max max p(y x ,y 2 |x,h 1? h 2 )/)(^ li)/^,,^). 

i W h„h 2 

x is shorthand for i t , and x is shorthand for *i. Equation (12) uses the known Viterbi 
approximation, assuming that the sum is dominated by the most likely word string fFand 
the most likely filters. Further, if it is assumed that there is no additive noise, which is 
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the case in FIG. 6, then p(y 19 y 2 | x 5 h 1? h 2 ) is a delta function. Equation (12) thus finds 
the most likely words in the speech recognition system that matches the microphone 
signals. As a result, this approach can be referred to as the cepstral approach. 

In the absence of prior information for the reconstruction filters, the approximate 
5 MAP filter estimates are: 

(hj , h 2 ) = arg max arg max p(W | x) [ . (13) 

h„h 2 I W ) 

These filter estimates encapsulate the a priori knowledge of the signal x , specifically that 
the input signal is human speech. The MAP filter estimates are then employed within the 
a standard known hidden Markov model (HMM)-based speech recognition system (804 
10 of FIG. 8). The reconstructed input signal x is usually decomposed into T frames x' of 
length N samples each: 

x' = jc|W + ii], (14) 
so that the inner term in equation (13) can be expressed as: 

argmax p(W | x) = n§^M* I *'), (15) 

W t=0 k=0 

15 where y t [k] is the a posteriori probability of frame t belonging to Gaussian k 9 which is 
one ofK Gaussians in the HMM. Large vocabulary systems can often use on the order of 
100,000 Gaussians. 

The term p(k | x') in equation (15), as used in most HMM speech recognition 
systems, includes what are known as cepstral vectors, resulting in a nonlinear equation, 
20 which is solved to obtain the actual reconstruction filters (806 of FIG. 8). This equation 
may be computationally prohibitive, especially for small devices such as wireless phones 
and personal digital assistant (PDA) devices that do not have adequate computational 
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power. Therefore, another approach is described next that approximates the cepstral 
approach and results in a more mathematically tractable solution. 

Vector Quantization (VP) Codebook of Linear Prediction (LPC) Vectors Approach 
To construct reconstruction filters based on knowledge of the type of sound 
5 source being targeted, a further embodiment approximates the speech recognition 
approach of the previous section of the detailed description. Rather than the word 
matching of the previous embodiment's approach, this embodiment focuses on pattern 
matching. More specifically, rather than determining the probability that a given 
decorrelated signal is a particular word, this approach determines the probability that a 
1 0 given decorrelated signal is one of a number of speech-type spectra. A codebook of 
speech-type spectra is used, such as sixteen or 256 different spectra. If there is a high 
probability that a given decorrelated signal is one of these spectra, then this corresponds 
to a high probability that the signal is a separated signal. 

The approximation of this approach uses an autoregressive (AR) model instead of 
1 5 a cepstral model. A vector quantization (VQ) codebook of linear prediction (LPC) 

vectors is used to determine the linear prediction (LPC) error of each of the number of 
speech-type spectra. Because this model is linear in the time domain, it is more 
computationally tractable than the cepstral approach, and therefore can potentially be 
used in less computationally powerful devices. Only a small group of different speech- 
20 type spectra needs to be stored, instead of an entire speech recognition system 

vocabulary. The error that is predicted is small for decorrelated signals that correspond 
to separated signals containing human speech. The VQ codebook of vectors encapsulates 
a priori knowledge regarding the desired target input signal. 
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The VQ codebook of LPC vectors approach to constructing unmixing filters is 
depicted in the flowchart 900 of FIG. 9. Mathematically, the LPC error of class k for 
signal x'[n] is first defined (902), as: 

ef["] = 2>f*'["-a ( 16 > 

1=0 

5 where i=0, 1, 2, . . p, and a\ = 1 . The average energy of the prediction error for the 
frame t is defined as: 

E^^LVM. (17) 

The probability for each class can be an exponential density function of the energy of the 
linear prediction error: 

p(S(W= _i= exp {_i}. 

In continuous density HMM systems, a Viterbi search is usually done, so that 
most y t [k] of equation (15) are zero, and the rest correspond to the mixture weights of 
the current state. To decrease computation time, and avoid the search process altogether, 
the summation in equation (15) can be approximated with the maximum: 

15 *3 (19) 

= argmax p(x' \ k), 

k 

where it is assumed that all classes are equally likely: 

p[k] = ^ 9 k^2 9 ... 9 K. (20) 
K 
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This assumption is based on the insight that only one of the speech-type spectra is likely 
the most probable, such that the other spectra can be dismissed. 

The reconstruction filters are obtained by inserting equation (19) into equations 
(15) and (13) to achieve minimization of the LPC error to obtain an estimate of the 
5 reconstruction filters (904 of FIG. 9): 



The maximization of a negative quantity has been replaced by its minimization, and the 
constant terms have been ignored. Normalization by Tis done for ease of comparison 
over different frame sizes. The optimal filters minimize the accumulated prediction error 
10 with the closest codeword per frame. These filter estimates encapsulate the a priori 
knowledge of the signal x , specifically that the input signal is human speech. 

Formulae can then be derived to solve the minimization equation (21) to obtain 
the actual reconstruction filters (906 of FIG. 9). The autocorrelation of x'[n] can be 
obtained by algebraic manipulation of equation (8): 




(21) 
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(22) 



E ^ M*2 M ( R L [i + »J + v] + Rnij + + v]) 



M=0 v=0 



where the cross-correlation functions have been defined as: 



Rj,[u,v] =—Yti[n-u]y>j[n-vl 



^ N-\ 



(23) 



19 



The autocorrelation of equation (22) has the following symmetry properties: 

*J[«,v] = *J»[v,«]. (24) 
Inserting equation (16) into equation (17), and using equation (22), E* can be 
expressed as: 



JV-l 

JV «=o V ''=0 J 
p p 



i=0 j=Q 
-1 o-l ( P P 



IS^M^M SZ fl /W + "'-/ +v ] (25) 



a=0 v=0 t /=0 J =0 

+2f j f j h 1 [u]h 2 [v)\f^ay j Rl 2 [i + uJ + v]\ 

u=0 v=0 [ /=0 7=0 J 

«=0 v=0 [ 1=0 j=0 J 

Inserting equation (25) into equation (21) yields the reconstruction filters. To achieve 
minimize, an iterative algorithm, such as the known expectation maximization (EM) 
algorithm. Such an algorithm iterates between find the best codebook indices k, and the 

best reconstruction filters (h x [ri\,h 2 \n]). 
1 0 The flowchart 1 000 of FIG. 1 0 outlines the EM algorithm in particular. An initial 

h[n\ h^n] are started with (1002). In the E-step (1004), for f=0, 1, . . ., T-l, the best 
codeword is found: 

k t =argrnin£f. • (26) 

k 

In the M-step (1006), the h x [n],h 2 [n] are found that minimize the overall energy error: 
15 (A>],4[n]) = argminif;^'- ( 27 > 
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If convergence is reached (1008), then the algorithm is complete (1010). Otherwise, 
another iteration is performed (1004, 1 006). Iteration continues until convergence is 
reached. 

Alternatively, since equation (25) given E) is quadratic in h x [ti],h 2 [n\, the 
5 optimal reconstruction filters can be obtained by taking the derivative and equating to 
zero. If all the parameters are free, the trivial solution is h^n] = h 2 [n] = 0 Vra , because 
cr 2 is not used in equation (1 8). To avoid this, \[0] is set to one, and solved for the 
remaining coefficients. This results in the following set of 2q-l linear equations: 

^h\u\b n [u,v] + ^h 2 [u]b 2X [u,v] = ^ v = l,2,...,tf-l (28) 

u=Q «=0 

10 ^A 1 [«]& zl [«,v] + ^A 2 [u]fe 22 [«,v] = 0 v = 0,i,...,<7-l, (29) 

u=0 u=0 

where: 

t =t G i=Q j=0 

b 2l [u,v] = W + "J + v l (30) 

t=t 0 i=0 y=0 

b 22 W, v] = YfcttfJjKAi + u,j + v]. 

t=t Q j=0 j=0 

Equations (28) and (29) are easily solved with any commonly available algebra package. 
It is noted that the time index does not start at zero, but rather at to, because samples of 
15 JiM^M are not available for n < 0. 
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Code-Excited Linear Prediction (CELP^ Vectors Approach 

In another embodiment, the VQ codebook of LPC vectors (short-term prediction) 
of the previous section of the detailed description is enhanced with pitch prediction (long- 
term prediction), as is done in code-excited linear prediction (CELP). The difference is 
5 that the error signal in equation (16) is known to be periodic, or quasi-periodic, so that its 
value can be predicted by looking at its value in the past. 

The CELP approach is depicted by reference again to the flowchart 900 of FIG. 9. 
The prediction error of equation (17) is again first defined (902), as: 

E^g t ^ = ^% k An]-g/M-r^ (31) 

10 where the long-term prediction denoted by pitch period r t can be used to predict the 
short-term prediction error by using a gain g t . If the speech is perfectly periodic, the 
gains g t of equation (31) are one, or substantially close to one. If the speech is at the 
beginning of a vowel, the gain is greater than one, whereas if it is at the end of a vowel 
before a silence, the gain is less than one. If the speech is not periodic, the gain should be 

15 close to zero. 

Using equation (16), equation (31) can be expanded as: 

l?(&,r,) = £Z^ (32) 

An estimate of the optimal reconstruction filters is obtained by minimizing the 
error (904 of FIG. 9): 

20 (4W,4W)-argmax^X^(^,f f ), (33) 

where: 



22 



EHiJt) = minmin£f'(g p r ; ), (34) 

and an extra minimization has been introduced over g t and t, . Although the 
minimization should be done jointly with k t , in practice this results in a combinatorial 
explosion. Therefore, a different solution is chosen, to solve the minimization to obtain 

5 the actual reconstruction filters (906 of FIG. 9). This entails minimization first on k t , and 
then on g t and r t jointly, as is often done in CELP coders. The search for r t can be done 
within a limited temporal range related to the pitch period of speech signals. 

The EM algorithm can be used to perform the minimization. Again referring to 
FIG. 10, an initial h\n\,h 2 [n] are started with (1002). In the E-step (1004), for t=0, 1, 

10 . . ., T- 1 , the best codeword is found: 

&,=argmin£*. (35) 

* 

In the M-step (1006), the h x [n], h 2 [n] are found that minimize the overall energy error: 

(*M*.M) = argmin-i^U^)- < 36 ) 

If convergence is reached (1008), then the algorithm is complete (1010). Otherwise, 
15 another iteration is performed (1004, 1006). Iteration continues until convergence is 
reached. 

Joint minimization of equation (35) can be accomplished by using the optimal g 
for every r : 

(=0 j=0 
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and searching for all values of t in the allowable pitch range. 

Alternatively, solutions of equation (36) given k t9 g t9 r t can be found by taking the 
derivative of equation (32) and equation it to zero. This leads to another set of 2q-l 
linear equations, as in equations (28) and (29), but where: 



b n [u,v] = 



T-l p P 



f=/ 0 ;=o j=0 



T-l P P 



R^i+uJ + v]- 
2g l R[ l {i + T ( +u,j + T t +v]+> 

gfRu[i + Tt+u>J + r t + v ] 
'R[ 2 [i+u,j + v}- 
b 2 \u,v] = ZZ2X a J Ig&V + T, +uj + v] + 

t-t 0 i=0 7=0 ? f , _ 

g;% 2 [l+T t +U,J + T t +v] 

iQi + uJ + v]- 

T-i P P 

2g < /^[i + «,7+v] + 

g'Ri 2 [i + t t +u,j + T t + v] 



t=t 0 i=0 7=0 



(38) 



Example Computerized Device 

FIG. 1 1 illustrates an example of a suitable computing system environment 10 in 
which the invention may be implemented. For example, the environment 10 may be the 
environment in which the inventive sound source separation is performed, and/or the 

10 environment in which the inventive unmixing filters are constructed. The computing 
system environment 10 is only one example of a suitable computing environment and is 
not intended to suggest any limitation as to the scope of use or functionality of the 
invention. Neither should the computing environment 10 be interpreted as having any 
dependency or requirement relating to any one or combination of components illustrated 

15 in the exemplary operating environment 1 0. 

The invention is operational with numerous other general purpose or special 
purpose computing system environments or configurations. Examples of well known 
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computing systems, environments, and/or configurations that may be suitable for use 
with the invention include, but are not limited to, personal computers, server computers, 
hand-held or laptop devices, multiprocessor systems, microprocessor-based systems. 
Additional examples include set top boxes, programmable consumer electronics, network 

5 PCs, minicomputers, mainframe computers, distributed computing environments that 
include any of the above systems or devices, and the like. 

The invention may be described in the general context of camputer-executable 
instructions, such as program modules, being executed by a computer. Generally, 
program modules include routines, programs, objects, components, data structures, etc. 

10 that perform particular tasks or implement particular abstract data types. The invention 
may also be practiced in distributed computing environments where tasks are performed 
by remote processing devices that are linked through a communications network. In a 
distributed computing environment, program modules may be located in both local and 
remote computer storage media including memory storage devices. 

1 5 An exemplary system for implementing the invention includes a computing 

device, such as computing device 10. In its most basic configuration, computing device 
10 typically includes at least one processing unit 12 and memory 14. Depending on the 
exact configuration and type of computing device, memory 1 4 may be volatile (such as 
RAM), non-volatile (such as ROM, flash memory, etc.) or some combination of the two. 

20 This most basic configuration is illustrated by dashed line 16. Additionally, device 10 

may also have additional features/functionality. For example, device 10 may also include 
additional storage (removable and/or non-removable) including, but not limited to, 
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magnetic or optical disks or tape. Such additional storage is illustrated in by removable 
storage 18 and non-removable storage 20. 

Computer storage media includes volatile, nonvolatile, removable, and non- 
removable media implemented in any method or technology for storage of information 
5 such as computer readable instructions, data structures, program modules, or other data. 
Memory 14, removable storage 18, and non-removable storage 20 are all examples of 
computer storage media. Computer storage media includes, but is not limited to, RAM, 
ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile 
disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk 
1 0 storage or other magnetic storage devices, or any other medium which can be used to 
store the desired information and which can accessed by device 10. Any such computer 
storage media may be part of device 1 0. 

Device 10 may also contain communications connection(s) 22 that allow the 
device to communicate with other devices. Communications connection(s) 22 is an 
15 example of communication media. Communication media typically embodies computer 
readable instructions, data structures, program modules, or other data in a modulated data 
signal such as a carrier wave or other transport mechanism and includes any information 
delivery media. The term "modulated data signal" means a signal that has one or more of 
its characteristics set or changed in such a manner as to encode information in the signal. 
20 By way of example, and not limitation, communication media includes wired media such 
as a wired network or direct-wired connection, and wireless media such as acoustic, RF, 
infrared and other wireless media. The term computer readable media as used herein 
includes both storage media and communication media. 
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Device 10 may also have input device(s) 24 such as keyboard, mouse, pen, sound 
input device (such as a microphone), touch input device, etc. Output device(s) 26 such as 
a display, speakers, printer, etc. may also be included. All these devices are well known 
in the art and need not be discussed at length here, 

5 The approaches that have been described can be computer-implemented methods 

on the device 1 0. A computer-implemented method is desirably realized at least in part 
as one or more programs running on a computer. The programs can be executed from a 
computer-readable medium such as a memory by a processor of a computer. The 
programs are desirably storable on a machine-readable medium, such as a floppy disk or 

10 a CD-ROM, for distribution and installation and execution on another computer. The 
program or programs can be a part of a computer system, a computer, or a computerized 
device. 

Conclusion 

It is noted that, although specific embodiments have been illustrated and 
1 5 described herein, it will be appreciated by those of ordinary skill in the art that any 

arrangement is calculated to achieve the same purpose may be substituted for the specific 
embodiments shown. This application is intended to cover any adaptations or variations 
of the present invention. Therefore, it is manifestly intended that this invention be 
limited only by the claims and equivalents thereof. 

20 
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